52 ◾ Bioinformatics
To count the total number of bases in the reference file, you can combine “grep”, “wc”, and
“awk” commands as follows:
grep -v “>” GRCh38.p13_ref.fna | wc | awk ‘{print $3-$1}’
If for any reason, you want to split the reference sequences into files, you can use the fol-
lowing script that creates the directory, “chromosomes”, and then it splits the main FASTA
file into several FASTA files:
mkdir chromosomes
cd chromosomes
csplit -s -z ../GRCh38.p13_ref.fna ‘/>/’ ‘{*}’
for i in xx* ; do \
n=$(sed ‘s/>// ; s/ .*// ; 1q’ “$i”) ; \
mv “$i” “$n.fa” ; \
done
The annotation files relevant to a reference genome may also be needed for some of the
steps in the downstream analysis. You can download the annotation file as above. The
annotation file is a description of where genetic element also called a feature such as genes,
introns, and exons are located in the genome sequence, showing the start and end coordi-
nates, and feature name. The annotation files are usually in GFF or GTF file format. The
GFF (General feature format) is a simple tab-delimited text file for describing genomic
features and mapping them to the reference sequence in the FASTA file. The GTF (Gene
Transfer Format) is similar to GFF but it has additional elements. Figure 2.3 shows the
first part of the human annotation file in the GFF format. The content of an annotation file
including the chromosome name or chromosome GenBank accession in the first column,
and features and other annotations are in the other columns.
Both the FASTA reference file and (sometimes) its annotation file are required by the
alignment programs, shortly called aligners, for mapping the reads in the FASTQ files to
FIGURE 2.2 Part of the FASTA sequence of the human reference genome.